MLLM 아키텍처의 진화: 시각 중심에서 다감성 통합으로

MLLM 아키텍처의 진화

다모달 대규모 언어 모델(MLLM)의 진화는 특정 모달리티에 국한된 고립 상태에서 통합 표현 공간으로의 전환을 의미하며, 비텍스트 신호(이미지, 음성, 3차원 데이터)가 모델이 이해할 수 있는 언어로 변환됩니다.

1. 시각에서 다감성으로

초기 MLLMs:이미지-텍스트 작업을 위해 주로 시각 트랜스포머(ViT)에 집중했습니다.
현대적 아키텍처:통합하여 음성 (예: HuBERT, Whisper) 및 3D 포인트 클라우드 (예: Point-BERT)를 통해 진정한 크로스모달 지능을 달성합니다.

2. 투영 다리

다양한 모달리티를 LLM과 연결하기 위해 수학적 다리가 필요합니다:

선형 투영:MiniGPT-4 같은 초기 모델에서 사용되는 간단한 매핑입니다.
$$X_{llm} = W \cdot X_{modality} + b$$
다층 MLP:비선형 변환을 통해 복잡한 특징의 우수한 정렬을 제공하는 두 층 구조 방식(예: LLaVA-1.5)입니다.
재샘플러/개념 추출기:고차원 데이터를 고정 길이 토큰으로 압축하는 고급 도구들로서, Perceiver Resampler(Flamingo) 또는 Q-Former 등이 있습니다.

3. 디코딩 전략

이산 토큰:출력을 특정 사전 항목(예: VideoPoet)으로 표현합니다.
연속 임베딩:특화된 하류 생성기(예: NExT-GPT)를 안내하기 위해 "소프트" 신호를 사용합니다.

투영 규칙

LLM이 음향이나 3차원 객체를 처리하려면, 해당 신호는 기존의 의미 공간 내부로 투영되어야 하며, 이는 '모달리티 신호'로 해석되도록 하고 노이즈로 오해되지 않도록 합니다.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

Which projection technique is generally considered superior to a simple Linear layer for complex modality alignment?

Token Dropping

Two-layer MLP or Resamplers (e.g., Q-Former)

Softmax Activation

Linear Projection

Question 2

What is the primary role of ImageBind or LanguageBind in this architecture?

To generate text from images

To compress video files

To create a Unified/Joint representation space for multiple modalities

To increase the LLM context window

Challenge: Designing an Any-to-Any System

Diagram the flow for an MLLM that takes an Audio input and generates a 3D model.

You are tasked with architecting a pipeline that allows an LLM to "listen" to an audio description and output a corresponding 3D object. Define the three critical steps in this pipeline.

Step 1

Select the correct encoder for the input signal.

Solution:
Use an Audio Encoder such as Whisper or HuBERT to transform the raw audio waves into feature vectors.

Step 2

Apply a Projection Layer.

Solution:
Pass the audio feature vectors through a Multi-layer MLP or a Resampler to align them with the LLM's internal semantic space (dimension matching).

Step 3

Generate and Decode the output.

Solution:
The LLM processes the aligned tokens and outputs "Modality Signals" (continuous embeddings or discrete tokens). These signals are then passed to a 3D-specific decoder (e.g., a 3D Diffusion model) to generate the final 3D object.